import numpy as np
import pandas as pd
import dalex as dx
import os
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
warnings.filterwarnings('ignore')
df = pd.read_csv('insurance.csv')
df.head()
| age | sex | bmi | children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| 0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
| 1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
| 2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
| 3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
| 4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
Checking for nulls.
df.isnull().sum()
age 0 sex 0 bmi 0 children 0 smoker 0 region 0 charges 0 dtype: int64
No nulls.
Encoding categorical features.
#sex
le = LabelEncoder()
le.fit(df.sex.drop_duplicates())
df.sex = le.transform(df.sex)
# smoker or not
le.fit(df.smoker.drop_duplicates())
df.smoker = le.transform(df.smoker)
#region
le.fit(df.region.drop_duplicates())
df.region = le.transform(df.region)
Checking correlation.
df.corr()
| age | sex | bmi | children | smoker | region | charges | |
|---|---|---|---|---|---|---|---|
| age | 1.000000 | -0.020856 | 0.109272 | 0.042469 | -0.025019 | 0.002127 | 0.299008 |
| sex | -0.020856 | 1.000000 | 0.046371 | 0.017163 | 0.076185 | 0.004588 | 0.057292 |
| bmi | 0.109272 | 0.046371 | 1.000000 | 0.012759 | 0.003750 | 0.157566 | 0.198341 |
| children | 0.042469 | 0.017163 | 0.012759 | 1.000000 | 0.007673 | 0.016569 | 0.067998 |
| smoker | -0.025019 | 0.076185 | 0.003750 | 0.007673 | 1.000000 | -0.002181 | 0.787251 |
| region | 0.002127 | 0.004588 | 0.157566 | 0.016569 | -0.002181 | 1.000000 | -0.006208 |
| charges | 0.299008 | 0.057292 | 0.198341 | 0.067998 | 0.787251 | -0.006208 | 1.000000 |
A strong correlation is observed only with smoking
x = df.drop(['charges'], axis = 1)
y = df.charges
x_train,x_test,y_train,y_test = train_test_split(x,y, random_state = 0)
lr = LinearRegression().fit(x_train,y_train)
y_train_pred = lr.predict(x_train)
y_test_pred = lr.predict(x_test)
print(lr.score(x_test,y_test))
0.7962732059725786
forest = RandomForestRegressor(n_estimators = 100,
criterion = 'mse',
random_state = 1,
n_jobs = -1)
forest.fit(x_train,y_train)
forest_train_pred = forest.predict(x_train)
forest_test_pred = forest.predict(x_test)
print('MSE train data: %.3f, MSE test data: %.3f' % (
mean_squared_error(y_train,forest_train_pred),
mean_squared_error(y_test,forest_test_pred)))
print('R2 train data: %.3f, R2 test data: %.3f' % (
r2_score(y_train,forest_train_pred),
r2_score(y_test,forest_test_pred)))
MSE train data: 3729086.094, MSE test data: 19933823.142 R2 train data: 0.974, R2 test data: 0.873
XGB = xgb.XGBRegressor()
XGB.fit(x_train,y_train)
XGB_train_pred = XGB.predict(x_train)
XGB_test_pred = XGB.predict(x_test)
print('MSE train data: %.3f, MSE test data: %.3f' % (
mean_squared_error(y_train,XGB_train_pred),
mean_squared_error(y_test,XGB_test_pred)))
print('R2 train data: %.3f, R2 test data: %.3f' % (
r2_score(y_train,XGB_train_pred),
r2_score(y_test,XGB_test_pred)))
MSE train data: 674528.731, MSE test data: 24561658.787 R2 train data: 0.995, R2 test data: 0.844
idx = 300
slct = x_train.iloc[[idx]]
print("LinearRegression:")
print(lr.predict(slct))
print("Forest:")
print(forest.predict(slct))
print("XGB:")
print(XGB.predict(slct))
print("correct:")
print(y_train.iloc[[idx]])
LinearRegression: [37516.36193715] Forest: [54945.3837506] XGB: [59010.766] correct: 1230 60021.39897 Name: charges, dtype: float64
explainer = dx.Explainer(lr,
data = x_test,
y = y_test)
explainer_forest = dx.Explainer(forest,
data = x_test,
y = y_test)
explainer_XGB = dx.Explainer(XGB,
data = x_test,
y = y_test)
Preparation of a new explainer is initiated -> data : 335 rows 6 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 335 values -> model_class : sklearn.linear_model._base.LinearRegression (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x0000024237F07D30> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 1.86e+02, mean = 1.34e+04, max = 4.02e+04 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -1.09e+04, mean = -11.0, max = 2.2e+04 -> model_info : package sklearn A new explainer has been created! Preparation of a new explainer is initiated -> data : 335 rows 6 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 335 values -> model_class : sklearn.ensemble._forest.RandomForestRegressor (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x0000024237F07D30> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 1.24e+03, mean = 1.43e+04, max = 5.1e+04 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -2.07e+04, mean = -8.88e+02, max = 2.12e+04 -> model_info : package sklearn A new explainer has been created! Preparation of a new explainer is initiated -> data : 335 rows 6 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 335 values -> model_class : xgboost.sklearn.XGBRegressor (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x0000024237F07D30> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 2.98e+02, mean = 1.42e+04, max = 4.92e+04 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -2.36e+04, mean = -7.18e+02, max = 2.09e+04 -> model_info : package xgboost A new explainer has been created!
lr_profile = explainer.predict_profile(slct)
lr_profile.plot(variables = ['age', 'sex', 'bmi', 'children','region','smoker'])
Calculating ceteris paribus: 100%|██████████████████████████████████████████████████████| 6/6 [00:00<00:00, 213.16it/s]
As we can see from CP profiles features with significant impact on the prediction are age, BMI and smoker. While smoker has the steepest which suggestes that this feature has the biggest impact on the prediction.
forest_profile = explainer_forest.predict_profile(slct)
forest_profile.plot(variables = ['age', 'sex', 'bmi', 'children','region','smoker'])
Calculating ceteris paribus: 100%|███████████████████████████████████████████████████████| 6/6 [00:00<00:00, 50.69it/s]
Main conclusion are the same as in the example above. Additionally we can see that predicted charges increase dramatically at BMI=30. There is also more irregularities in the plots, which couldn't exist in the first example because of different nature of the used model. Moreover it is easier to differentiate between continuous and discrete features.
XGB_profile = explainer_XGB.predict_profile(slct)
XGB_profile.plot(variables = ['age', 'sex', 'bmi', 'children','region','smoker'])
Calculating ceteris paribus: 100%|██████████████████████████████████████████████████████| 6/6 [00:00<00:00, 130.43it/s]
The differences between second and third model are much smaller than between first and the second one. It is still possible to see the jump in prediction at BMI=30. There is also even more irregularities in the plots then in the second example. We can predicted that this effect takes place becouse the third model is more complex than the second one. Additionally for the first time we can see different predictions based on sex.
As we can see from CP profiles features with significant impact on the prediction are age, BMI and smoker. While smoker has the steepest which suggestes that this feature has the biggest impact on the prediction. While sex and region appear to have little to no impact on the prediction.
Main conclusion are the same as in the example above. Additionally we can see that predicted charges increase dramatically at BMI=30. There is also more irregularities in the plots, which couldn't exist in the first example because of different nature of the used model. Moreover it is easier to differentiate between continuous and discrete features. Furthermore prediction is about 8% off correct value, while the first model is about 40% off.
Main conclusion are the same as in the example above. Additionally we can see that predicted charges increase dramatically at BMI=30. There is also more irregularities in the plots, which couldn't exist in the first example because of different nature of the used model. Moreover it is easier to differentiate between continuous and discrete features. Furthermore prediction is much closer to the correct value. Furthermore prediction is only about 1.5% off correct value.